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Abstract 


This work addresses the problem of regret minimization in non-stochastic multi¬ 
armed bandit problems, focusing on performance guarantees that hold with high 
probability. Such results are rather scarce in the literature since proving them re¬ 
quires a large deal of technical effort and significant modifications to the standard, 
more intuitive algorithms that come only with guarantees that hold on expectation. 

One of these modifications is forcing the learner to sample arms from the uniform 
distribution at least V.iy/T) times over T rounds, which can adversely affect per¬ 
formance if many of the arms are suboptimal. While it is widely conjectured that 
this property is essential for proving high-probability regret bounds, we show in 
this paper that it is possible to achieve such strong results without this undesirable 
exploration component. Our result relies on a simple and intuitive loss-estimation 
strategy called Implicit exploration (IX) that allows a remarkably clean analy¬ 
sis. To demonstrate the flexibility of our technique, we derive several improved 
high-probability bounds for various extensions of the standard multi-armed bandit 
framework. Finally, we conduct a simple experiment that illustrates the robustness 
of our implicit exploration technique. 

1 Introduction 

Consider the problem of regret minimization in non-stochastic multi-armed bandits, as defined in 
the classic paper of Auer, Cesa-Bianchi, Freund, and Schapire 0. This sequential decision-making 
problem can be formalized as a repeated game between a learner and an environment (sometimes 
called the adversary). In each round t = 1,2,... ,T, the two players interact as follows: The 
learner picks an arm (also called an action) I t £ [K] = {1,2,..., K} and the environment selects 
a loss function I t : [K] — > [0,1], where the loss associated with arm i £ [K] is denoted as lt,i- 
Subsequently, the learner incurs and observes the loss £ t ,i t ■ Based solely on these observations, the 
goal of the learner is to choose its actions so as to accumulate as little loss as possible during the 
course of the game. As traditional in the online learning literature Qa, we measure the performance 
of the learner in terms of the regret defined as 


T 


T 



We say that the environment is oblivious if it selects the sequence of loss vectors irrespective of 
the past actions taken by the learner, and adaptive (or non-oblivious) if it is allowed to choose l t 
as a function of the past actions 7 t _i,..., . An equivalent formulation of the multi-armed bandit 

game uses the concept of rewards (also called gains or payoffs) instead of losses: in this version, 
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the adversary chooses the sequence of reward functions (r t ) with r tji denoting the reward given 
to the learner for choosing action i in round t. In this game, the learner aims at maximizing its 
total rewards. We will refer to the above two formulations as the loss game and the reward game , 
respectively. 

Our goal in this paper is to construct algorithms for the learner that guarantee that the regret grows 
sublinearly. Since it is well known that no deterministic learning algorithm can achieve this goal 
ED, we are interested in randomized algorithms. Accordingly, the regret Rt then becomes a ran¬ 
dom variable that we need to bound in some probabilistic sense. Most of the existing literature on 
non-stochastic bandits is concerned with bounding the pseudo-regret (or weak regret) defined as 


T 


T 



Rt = max E 

ie[K] 


where the expectation integrates over the randomness injected by the learner. Proving bounds on 
the actual regret that hold with high probability is considered to be a significantly harder task that 
can be achieved by serious changes made to the learning algorithms and much more complicated 
analyses. One particular common belief is that in order to guarantee high-confidence performance 
guarantees, the learner cannot avoid repeatedly sampling arms from a uniform distribution, typically 
O(vXT) times (5] SI 13 HI • It is easy to see that such explicit exploration can impact the empirical 
performance of learning algorithms in a very negative way if there are many arms with high losses: 
even if the base learning algorithm quickly learns to focus on good arms, explicit exploration still 
forces the regret to grow at a steady rate. As a result, algorithms with high-probability performance 
guarantees tend to perform poorly even in very simple problems mm. 

In the current paper, we propose an algorithm that guarantees strong regret bounds that hold with 
high probability without the explicit exploration component. One component that we preserve from 
the classical recipe for such algorithms is the biased estimation of losses , although our bias is of 
a much more delicate nature, and arguably more elegant than previous approaches. In particular, 
we adopt the implicit exploration (IX) strategy first proposed by Kocak, Neu, Valko, and Munos 
03 for the problem of online learning with side-observations. As we show in the current pa¬ 
per, this simple loss-estimation strategy allows proving high-probability bounds for a range of non¬ 
stochastic bandit problems including bandits with expert advice, tracking the best arm and bandits 
with side-observations. Our proofs are arguably cleaner and less involved than previous ones, and 
very elementary in the sense that they do not rely on advanced results from probability theory like 
Freedman’s inequality (l2j . The resulting bounds are tighter than all previously known bounds and 
hold simultaneously for all confidence levels, unlike most previously known bounds 13 0- For the 
first time in the literature, we also provide high-probability bounds for anytime algorithms that do 
not require prior knowledge of the time horizon T. A minor conceptual improvement in our analysis 
is a direct treatment of the loss game, as opposed to previous analyses that focused on the reward 
game, making our treatment more coherent with other state-of-the-art results in the online learning 
literature])] 

The rest of the paper is organized as follows. In Section[2] we review the known techniques for prov¬ 
ing high-probability regret bounds for non-stochastic bandits and describe our implicit exploration 
strategy in precise terms. Section [3] states our main result concerning the concentration of the IX 
loss estimates and shows applications of this result to several problem settings. Finally, we conduct 
a set of simple experiments to illustrate the benefits of implicit exploration over previous techniques 
in Section [I] 

2 Explicit and implicit exploration 

Most principled learning algorithms for the non-stochastic bandit problem are constructed by using 
a standard online learning algorithm such as the exponentially weighted forecaster ( l26l j20l lT3l ) 
or follow the perturbed leader ( lfl4lfT8l ) as a black box, with the true (unobserved) losses replaced 
by some appropriate estimates. One of the key challenges is constructing reliable estimates of the 
losses £ t ,i for all i £ [/\] based on the single observation £t,i t - Following Auer et al. 0, this is 

1 In fact, studying the loss game is colloquially known to allow better constant factors in the bounds in many 
settings (see, e.g., Bubeck and Cesa-Bianchi ©)■ Our result further reinforces these observations. 
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( 1 ) 


traditionally achieved by using importance-weighted loss/reward estimates of the form 


l,i = —hi t =i} 

Pt,i 


- _ r t ,i„ 

r t,i — il{/ t =i} 
Pt,i 


where p t ,i = P [It = i\ J~t- 1 ] is the probability that the learner picks action i in round t, conditioned 
on the observation history T %-1 of the learner up to the beginning of round t. It is easy to show that 
these estimates are unbiased for all i with p t ^ > 0 in the sense that E£ t ,i = &t,i for all such i. 


For concreteness, consider the Exp 3 algorithm of Auer et al. 0 as described in Bubeck and Cesa- 
Bianchi 0 Section 3], In every round t, this algorithm uses the loss estimates defined in Equation (|TJ) 
to compute the weights Wt,i = exp(— 77 X^=i &s-i,i) for all * and some positive parameter 77 that 
is often called the learning rate. Having computed these weights, Exp 3 draws arm I t = i with 
probability proportional to w t .%- Relying on the unbiasedness of the estimates 0 and an optimized 
setting of //, one can prove that Exp 3 enjoys a pseudo-regret bound of \/2TK log K. However, the 
fluctuations of the loss estimates around the true losses are too large to permit bounding the true 
regret with high probability. To keep these fluctuations under control, Auer et al. 0 propose to use 
the biased reward-estimates 

n,i = r t i + — ( 2 ) 

Pt,i 


with an appropriately chosen /3 > 0. Given these estimates, the Exp 3.P algorithm of Auer et al. 0 
computes the weights wtj = exp(rj r 9 ,i) for all arms i and then samples I t according to the 
distribution 


Pt,i = (1 -7) 



1_ 

K' 


where 7 £ [0,1] is the exploration parameter. The argument for this explicit exploration is that it 
helps to keep the range (and thus the variance) of the above reward estimates bounded, thus enabling 
the use of (more or less) standard concentration result^] In particular, the key element in the analysis 
of Exp3.P 013 El El is showing that the inequality 


( r M 

t= 1 


log (K/S) 

P 


holds simultaneously for all i with probability at least 1 —8. In other words, this shows that the 
cumulative estimates Y^t= 1 are upper confidence bounds for the true rewards Y^it =1 r t,i- 

In the current paper, we propose to use the loss estimates defined as 


H,i — 


-I 


Pt,i + 7 1 




(3) 


for all i and an appropriately chosen j t > 0 , and then use the resulting estimates in an exponential- 
weights algorithm scheme without any explicit exploration. Loss estimates of this form were first 
used by Kocak et al. ED —following them, we refer to this technique as Implicit exploration , or, 
in short, IX. In what follows, we argue that that IX as defined above achieves a similar variance- 
reducing effect as the one achieved by the combination of explicit exploration and the biased reward 
estimates of Equation ([2]). In particular, we show that the IX estimates 0 constitute a lower con¬ 
fidence bound for the true losses which allows proving high-probability bounds for a number of 
variants of the multi-armed bandit problem. 


3 High-probability regret bounds via implicit exploration 

In this section, we present a concentration result concerning the IX loss estimates of Equation 0 . 
and apply this result to prove high-probability performance guarantees for a number of non¬ 
stochastic bandit problems. The following lemma states our concentration result in its most general 
form: 

2 Explicit exploration is believed to be inevitable for proving bounds in the reward game for various other 
reasons, too—see Bubeck and Cesa-Bianchi (9j for a discussion. 
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Lemma 1. Let (yy t ) be a fixed non-increasing sequence with > 0 and let aty be nonnegative 
Tt-i-measurable random variables satisfying at,i < 2^ for all t and i. Then, with probability at 
least 1 — d, 

T K 

EE at,i (t t,i - ki) < log (l/<5). 

t=l i=i 


A particularly important special case of the above lemma is the following: 
Corollary 1. Let y t = 7 > 0 for all t. With probability at least 1 — 8, 


£ (?., 


, \ < log (K/S) 

t,l J ~ 27 


simultaneously holds for all i £ [A"]. 


This corollary follows from applying Lemma |Tj to the functions a t ,i = for all j and 

applying the union bound. The full proof of Lemma |T| is presented in the Appendix. For didactic 
purposes, we now present a direct proof for Corollary [M which is essentially a simpler version of 
Lemma [I] 


Proof of Corollary^ 7] For convenience, we will use the notation fi = 27 . First, observe that 

1 


Pt,i + 7 


H{L=i} < 


Pt,i + iki 


= — 


27 ^t,i/Pt,i tt ,1 

-llrr.=ii < 


27 1 + 7 l} 0 l0S ( 


1 + / 


where the first step follows from £ t i £ [ 0 , 1 ] and last one from the elementary inequality l _^ z / 2 < 
log(l + z) that holds for all 0 > 0. Using the above inequality, we get that 


E 




<E 


1 + 0Zt,i 


?t -1 


< 1 + 0i t ,i < exp (0i t ,i), 


where the second and third steps are obtained by using E 



T t -1 


< i t i that holds by definition 


of £ t ,i, and the inequality 1 + z < e z that holds for all z £ R. As a result, the process Z t = 
exp(/3 i^s,i — %s,i)) is a supermartingale with respect to (T t ): E [Z t \ T t _ 1 ] < Z t _\. Observe 
that, since Z 0 = 1, this implies E [Z T ] < E [Zt-i\ <...<1, and thus by Markov’s inequality. 


Tiki - ki) > £ 


,t =1 


< E 


exp ( 0^2(£t,i ~ki) 


t =1 


• exp(— 0e) < exp(— fie) 


holds for any e > 0. The statement of the lemma follows from solving exp(— fie) = 8/K for e and 
using the union bound over all arms i. □ 


In what follows, we put Lemma[I]to use and prove improved high-probability performance guaran¬ 
tees for several well-studied variants of the non-stochastic bandit problem, namely, the multi-armed 
bandit problem with expert advice, tracking the best arm for multi-armed bandits, and bandits with 
side-observations. The general form of Lemma[T]will allow us to prove high-probability bounds for 
anytime algorithms that can operate without prior knowledge of T. For clarity, we will only provide 
such bounds for the standard multi-armed bandit setting; extending the derivations to other settings 
is left as an easy exercise. For all algorithms, we prove bounds that scale linearly with log( 1 /<5) and 
hold simultaneously for all levels 5. Note that this dependence can be improved to -y/log(l/<5) for a 
fixed confidence level S, if the algorithm can use this <5 to tune its parameters. This is the way that 
Table [T]presents our new bounds side-by-side with the best previously known ones. 
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Setting 

Best known regret bound 

Our new regret bound 

Multi-armed bandits 

Bandits with expert advice 
Tracking the best arm 

Bandits with side-observations 

5.15i/TA"log(Ay<5) 

6^TK\og(N/6) 

2^/2TK\og{K/5) 

2^2TKlog(N/6) 

7^/KTS \og{KT /SS) 

0(VmT) 

2^J2KTS \og(KT/6S) 

O(VaT) 


Table 1: Our results compared to the best previously known results in the four settings considered 
in Sections 3.1 3.4 See the respective sections for references and notation. 


Algorithm 1 Exp3-IX 


3.1 Multi-armed bandits 

In this section, we propose a variant of the 
Exp 3 algorithm of Auer et al. El that uses the 
IX loss estimates 0: EXP3-IX. The algorithm Parameters: 77 > 0, 7 > 0. 
in its most general form uses two nonincreasing initialization. w \— 1 . 
sequences of nonnegative parameters: ( 77 *) and 
( 7 1 ). In every round, EXP3-IX chooses action 
I t = i with probability proportional to 

Pt,i oc w t ,i = exp ^-r) t ^ £ s ,^j , (4) 

without mixing any explicit exploration term 
into the distribution. A fixed-parameter version 
of EXP3-IX is presented as Algorithm[T] 


Wl,i = 

for t = 1,2,... ,T, repeat 

L Vt ' 1 ~ >:r • 

2. Draw I t ~ p t = (p t) 1 , 

3. Observe loss £ t ,i,■ 


,Pt,K)- 


4. £t,i pt *+ 7 I{/ f =i} f° r a ll ^ £ [^]- 

5. Wt+i t i <— for all i £ [A], 


Our theorem below states a high-probability bound on the regret of EXP3-IX. Notably, our bound 
exhibits the best known constant factor of 2\/2 in the leading term, improving on the factor of 5.15 
due to Bubeck and Cesa-Bianchi j9j. The best known leading constant for the pseudo-regret bound 
of Exp 3 is \/2, also proved in Bubeck and Cesa-Bianchi 0. 


Theorem 1. Fix an arbitrary 5 > 0. With r] t = 2y t = yj 2 for all t, EXP3-IX guarantees 
R t < 2^/2KT\ogK+ ( + 1 ) log (2/5) 


with probability at least 1 —<5. Furthermore, setting rj t = 2y t = f or t bound becomes 

Rt <4y/KT log K + ( 2 J + 1 ) log {2/S). 


Proof. Let us fix an arbitrary 6' £ (0,1). Following the standard analysis of Exp 3 in the loss game 
and nonincreasing learning rates 0. we can obtain the bound 


£t,A f 

t= 1 Vi=l / 

for any j. Now observe that 

K K 


K 


t= 1 i— 1 


7 _V^TT £t,i (pt,i + It) Y^TT 
2_^Pt,dt,i — /_,!{/*=»} „ 1 ry Tt / 


l°g K ht y^ (7 A' 

= £tj t -7tJ2^P (5 > 


Vt 

K 




K 


Pt,i 


■ , -1 Pt,i + 7* • 1 J't,,* ' IV-V.O ■ , 

2=1 2=1 ’ 2=1 ’ ’ 2=1 

Similarly, Pt,i£ti — T,f=i ^t,i holds by the boundedness of the losses. Thus, we get that 

t (4/. - < t (4, -4,) + ^ + £ (| + 7.) £ 4. 

^^ + -1 »-* + -1 _1 


t= 1 


£=1 


< log(-^/^ / ) log if 
27 77 


yi (y+7t) y +log (i/(5') 


£=1 


2=1 


5 




































holds with probability at least 1 — 25', where the last line follows from an application of Lemma[T] 
with a t ,i = r/t/2 + y t for all t,i and taking the union bound. By taking j = argmin^ Lt,i and 
5' = 5/2, and using the boundedness of the losses, we obtain 


Rt — 


log (2K/8) 
27 T 


log K 
Vt 


T 




+ log (2/(5). 


The statements of the theorem then follow immediately, noting that Y^t=i 1 /Vt < 2 y/T. D 


3.2 Bandits with expert advice 


We now turn to the setting of multi-armed bandits with expert advice, as defined in Auer et al. a. 
and later revisited by McMahan and Streeter ll22l and Beygelzimer et al. 0 . In this setting, we 
assume that in every round t = 1, 2,..., T, the learner observes a set of N probability distributions 
£ t (l),£t(2),... ,£ t (N) S [0, 1\ K over the K arms, such that Y,f=i 6 ,i(”) = 1 for a U n £ [ N ]- 
We assume that the sequences (£*(n)) are measurable with respect to (J 7 *)- The n th of these vectors 
represent the probabilistic advice of the corresponding n th expert. The goal of the learner in this 
setting is to pick a sequence of arms so as to minimize the regret against the best expert: 


T T K 

Rt = it,i t ~ min V' V €t,i{n)i t ,i min. 

z ' n£[jVl zz ' 

t= 1 L J t= 1 i—1 

To tackle this problem, we propose a modification of the Exp 4 algorithm of Auer et al. a that uses 
the IX loss estimates Q. and also drops the explicit exploration component of the original algorithm. 
Specifically, EXP4-IX uses the loss estimates defined in Equation ([3]) to compute the weights 

/ ‘- 1 K „ \ 

w t ,n = exp I -77 

\ S=1 *=1 / 


for every expert n £ [W], and then draw arm i with probability p t . t cx Y^n= 1 w t,n£,t,i{n). We now 
state the performance guarantee of EXP4-IX. Our bound improves the best known leading constant 
of 6 due to Beygelzimer et al. J7] to 2^/2 and is a factor of 2 worse than the best known constant in 
the pseudo-regret bound for Exp4 0 . The proof of the theorem is presented in the Appendix. 

Theorem 2. Fix an arbitrary 8 > 0 and set 77 = 27 = 2l ^ 6 T ^ for all t. Then, with probability at 

least 1 — 5, the regret of EXP4-IX satisfies 


f?| < 2 a/2 KT log N + 



log (2/5). 


3.3 Tracking the best sequence of arms 


In this section, we consider the problem of competing with sequences of actions. Similarly to 
Herbster and Warmuth 03 , we consider the class of sequences that switch at most S times between 
actions. We measure the performance of the learner in this setting in terms of the regret against the 
best sequence from this class C(S) C [K ] T , defined as 


Rn 




i=l 


T 


min 

(Jt)ec(s) 




Similarly to Auer et al. Q, we now propose to adapt the Fixed Share algorithm of Herbster and 
Warmuth DU to our setting. Our algorithm, called EXP3-SIX, updates a set of weights ivt. over 
the arms in a recursive fashion. In the first round, EXP3-SIX sets w\ y i = 1 /K for all i. In the 
following rounds, the weights are updated for every arm % as 


Wt+i,i = (1 - a)w t ,i ■ 


K 



3 =1 


6 









In round t, the algorithm draws arm I t = i with probability p ft i oc Wt./,. Below, we give the 
performance guarantees of EXP3-SIX. Note that our leading factor of 2\j2 again improves over the 
best previously known leading factor of 7, shown by Audibert and Bubeck 0. The proof of the 
theorem is given in the Appendix. 

Theorem 3. Fix an arbitrary 8 > 0 and set 77 = and a = T ‘^ | , where S = S + 1. 

Then, with probability at least 1 — 5, the regret of EXP3-SIX satisfies 

Rf < + (J]gF + l) log (2ffl . 


3.4 Bandits with side-observations 


Let us now turn to the problem of online learning in bandit problems in the presence of side ob¬ 
servations, as defined by Mannor and Shamir f2~fl and later elaborated by Alon et al. m. In this 
setting, the learner and the environment interact exactly as in the multi-armed bandit problem, the 
main difference being that in every round, the learner observes the losses of some arms other than 
its actually chosen arm I t . The structure of the side observations is described by the directed graph 
G: nodes of G correspond to individual arms, and the presence of arc i —» j implies that the learner 
will observe £ t>J upon selecting I t = i. 


Implicit exploration and EXP3-IX was first proposed by Kocak et al. ED for this precise setting. 
To describe this variant, let us introduce the notations Ot,i = E{i t =i} + I{(/ t ->-i)eG} and °t,i = 

E [Ot,i\ Ft- 1 ]. Then, the IX loss estimates in this setting are defined for all t, i as i t ,i = ■ 

With these estimates at hand, EXP3-IX draws arm I t from the exponentially weighted distribution 
defined in Equation (|4]). The following theorem provides the regret bound concerning this algorithm. 


Theorem 4. Fix an arbitrary 8 > 0. Assume that T > K 2 /(8a) and set rj = 27 = \J^TTTcF(kt)’ 
where a is the independence number ofG. With probability at least 1 — 5, EXP3-IX guarantees 


log K 


Rt < ( 4+2 Vlog (4/5)) ^2aT{\og 2 K+\ogKT) L °g ■ 


The proof of the theorem is given in the Appendix. While the proof of this statement is significantly 
more involved than the other proofs presented in this paper, it provides a fundamentally new result. 
In particular, our bound is in terms of the independence number a and thus matches the minimax 
regret bound proved by Alon et al. E| for this setting up to logarithmic factors. In contrast, the only 
high-probability regret bound for this setting due to Alon et al. 11 scales with the size m of the 
maximal acyclic subgraph of G, which can be much larger than a in general (i.e., m may be o(a) 
for some graphs ID). 

4 Empirical evaluation 

We conduct a simple experiment to demonstrate the robustness of EXP3-IX as compared to Exp 3 
and its superior performance as compared to Exp 3.P. Our setting is a 10-arm bandit problem where 
all losses are independent draws of Bernoulli random variables. The mean losses of arms 1 through 
8 are 1/2 and the mean loss of arm 9 is 1/2 — A for all rounds t = 1,2,... ,T. The mean losses of 
arm 10 are changing over time: for rounds t <T/ 2, the mean is 1/2 + A, and 1/2 — 4A afterwards. 
This choice ensures that up to at least round T /2, arm 9 is clearly better than other arms. In the 
second half of the game, arm 10 starts to outperform arm 9 and eventually becomes the leader. 

We have evaluated the performance of Exp3, Exp 3.P and EXP3-IX in the above setting with T = 
10 6 and A = 0.1. For fairness of comparison, we evaluate all three algorithms for a wide range 
of parameters. In particular, for all three algorithms, we set a base learning rate // according to the 
best known theoretical results |9] Theorems 3.1 and 3.3] and varied the multiplier of the respective 
base parameters between 0.01 and 100. Other parameters are set as 7 = 77/2 for EXP3-IX and 
j3 = 7/ K - 77 for Exp 3.P. We studied the regret up to two interesting rounds in the game: up 
to T/ 2, where the losses are i.i.d., and up to T where the algorithms have to notice the shift in the 
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Figure 1: Regret of Exp3, Exp 3.P, and EXP3-IX, respectively in the problem described in Sec¬ 
tion [4] 


loss distributions. Figure [T] shows the empirical means and standard deviations over 50 runs of the 
regrets of the three algorithms as a function of the multipliers. The results clearly show that Exp 3- 
IX largely improves on the empirical performance of Exp 3.P and is also much more robust in the 
non-stochastic regime than vanilla Exp3. 

5 Discussion 

In this paper, we have shown that, contrary to popular belief, explicit exploration is not necessary to 
achieve high-probability regret bounds for non-stochastic bandit problems. Interestingly, however, 
we have observed in several of our experiments that our IX-based algorithms still draw every arm 
roughly y/T times, even though this is not explicitly enforced by the algorithm. This suggests a need 
for a more complete study of the role of exploration, to find out whether pulling every single arm 
f l(VT) times is necessary for achieving near-optimal guarantees. 

One can argue that tuning the IX parameter that we introduce may actually be just as difficult in 
practice as tuning the parameters of Exp 3.P. However, every aspect of our analysis suggests that 
7 1 = tit/2 is the most natural choice for these parameters, and thus this is the choice that we 
recommend. One limitation of our current analysis is that it only permits deterministic learning-rate 
and IX parameters (see the conditions of Lemma[T]l. That is, proving adaptive regret bounds in the 
vein of lT5il24ll23l that hold with high probability is still an open challenge. 

Another interesting question for future study is whether the implicit exploration approach can help in 
advancing the state of the art in the more general setting of linear bandits. All known algorithms for 
this setting rely on explicit exploration techniques, and the strength of the obtained results depend 
crucially on the choice of the exploration distribution (see 181 H61 for recent advances). Interestingly, 
IX has a natural extension to the linear bandit problem. To see this, consider the vector V* = ej t and 
the matrix P t = E [VfVj T ]. Then, the IX loss estimates can be written as l t = ( P t + 'yI)~ 1 V t V t T £f 
Whether or not this estimate is the right choice for linear bandits remains to be seen. 

Finally, we note that our estimates ([3) are certainly not the only ones that allow avoiding explicit ex¬ 
ploration. In fact, the careful reader might deduce from the proof of Lemma[I]that the same concen¬ 
tration bound can be shown to hold for the alternative loss estimates ^t,*I{z t =i}/ ( Pt,i + l£t,i) and 
log(l + /pt,/) /(27). Actually, a variant of the latter estimate was used previously for 

proving high-probability regret bounds in the reward game by Audibert and Bubeck (4)—however, 
their proof still relied on explicit exploration. It is not hard to verify that all the results we presented 
in this paper (except Theorem[4]i can be shown to hold for the above two estimates, too. 
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A The proof of Lemma [T] 

Fix any t. For convenience, we will use the notation /3 t = 2 j t . First, observe that for any i. 


\i =-1 1T,=4\ < 


t,i tt 1 ‘2'Ytlt,i/pt,i ,, , 1 , { t i a 'e \ 

hh=i} = ^7 • ' log v 1 + > 


Pt.,i + Jt 1 1 Pt,i + 7 t@t,i 1 * 1 27 1 1 + 7 t£t,i/Pt,i 1 * ^ /3 t 

the first step follows from £ t 
log(l + z) that holds for all z > 0. 


where the first step follows from £ t ,i £ [0,1] and last one from the elementary inequality l _^ z / 2 < 


Define the notations A t = JT 1 cttylty and At = JT =1 a t,i^t,i■ Using the above inequality, we get 
that 


E 


exp (At 


T -1 


<E 


<E 


K 


\i=l 


Pt 


/ ~ x\ 


(1 + Pt£t,i) I 

Tt-! 


K 

^1 + a t y£t ,i 

.*=l 
K 


T t -i 

/ tf 


= E 




1 T ^ ( ett,i^t,t 


i=l 


^t-i 


( 6 ) 


<1 + E &t,i£t,i < exp I E oi t ,i£t,i = exp (At), 

i= 1 Vi=l ) 

where the second line follows from noting that aty < fit, using the inequality x log(l+y) < log(l + 
xy) that holds for all y > — 1 and x £ [0,1] and the identity J1 i=l (l + — 1+yVi Otty(-t,i 

that follows from the fact that £ t ,i • £t,j = 0 holds whenever i j. The last line is obtained by using 
< It. i that holds by definition of £ t y. and the inequality 1 + z < e z that holds for all 


E 


££,2 


T t -1 


z £ M. 


As a result, the process Z t = exp(J]* =1 (A s — A s )) is a supermartingale with respect to [Tt)'. 
E [Z t | Tt- i] < Z t ~ i. Observe that, since Zq = 1, this implies E [Zy] < E [Zt- i] < ... < 1, and 
thus by Markov’s inequality, 


E( At “ At ) > 


< E 


exp(—e) < exp(—e) 


ex P E( At “ At ) 

\t=i 

holds for any e > 0. The statement of the lemma follows from solving exp(— e) = 6 for e. Q 

B Further proofs 

B. 1 The proof of Theorem [2] 

Fix an arbitrary S'. For ease of notation, let us define n t (n) = Wt !n / (XE=i w t,m). By standard 
arguments (along the lines of 00), we can obtain 

T K i „ T N / I< _\ 2 

EE0°m -&,* + fEE^w (EE i(n ) £ M) 

t=l i=l ^ ^ t=l n=l Vi=l / 

for any fixed m £ [TV]. The last term on the right-hand side can be bounded as 

N [ K \ 2 AT K 2 K 2 K 

E] 7r *( n ) (E^i( n Kt,t) <E^wE^w(£,*) = E^(^a) <E^-i> 


n=l 


n—1 i=l 


where the first step uses Jensen’s inequality and the last uses pty/ty < 1. Now, we can apply 
Lemma Q] and the union bound to show that 


T K 


EE (Jt,i 


- it,i < 


t—1 2=1 


log [N/S') 
2 7 
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holds simultaneously for all experts with probability at least 1 — S', and in particular for the best 
expert, too. Putting this observation together with the above bound and Equation ([5]), we get that 


< k>giV + log (N/S') 


27 




t= 1 2=1 


< log K 

V 


log {N/S') , , v 


27 




2=1 


(V , log (1/5') 

U—2^— 


holds with probability at least 1 — 25', where the last line follows from Lemma [T] and the union 
bound. The proof is concluded by taking S' = 5/2 and plugging in the choices of 7 and 77. □ 


B.2 The proof of Theorem [3] 


The proof of the theorem builds on the techniques of Cesa-Bianchi et al. CD and Auer et al. 0. 
Let us fix an arbitrary S' £ (0,1) and denote the best sequence from C(S) by J*. T . Then, a 
straightforward modification of Theorem 2 of lilTI yields the bounc0 


K 




2S\ogK 

V 


7=1 \i=l / 

To proceed, let us apply Lemma [T] to obtain that 


log (a s (l-a) T 5 ) + ^YY Pt S ( £ m) ' 


t=l 2=1 


Y (v* 


-1 




< log(\C(S)\/6) 
27 


simultaneously holds for all sequences J\ £ C(S). By standard arguments (see, e.g., the proof of 

_ Cj 

Theorem 22 in Audibert and Bubeck 0), one can show that |C(S')| < I\ s (^-) ■ Now, combining 
the above with Equation ([5]) and Pt,dt i — 12iLi ^t,i> we get that 


Y (hit ~ kj;) < 


< 


2 S' log K 1 

rj r\ 

25 log K 1 


log (a s (l-a) T - g ) + 


\og(T/(S5')) + 1 


T K 


27 

log (a s ( 1 - af~ S ) + lo g( r /( 2 ^)) + * 1 

+ ( 2 + 7 L i » + (, 2 +7 J 


t=i j=i 


77 


V 

K 


27 


holds with probability at least 1 — 2<5'. where the last line follows from Lemma IT] and the union 
bound. Then, after observing that the losses are bounded in [0,1] and choosing 5-= 5/ 2, we get 
that 


R S < (S+ 1 ) log K _ 1 
1 ~ 77 77 

+ ( 1 + 7 ) KT+ 

holds with probability at least 1 — 
showing that 


log („7i - + (s + 1)log y slog( W :) 

/T7 \ log(2 / 5 ) 

V 2 V 27 

S. The only remaining piece required for proving the theorem is 


- log (a s (l - af~ § ) < Slog (^ 0 , 

which follows from the proof of Corollary 1 in hd, and then substituting the choice of 77 and 7. □ 


3 Proving this bound requires replacing Hoeffding’s inequality in their Lemma 1 by the inequality e z < 

1 — z + z 2 /2 that holds for all z > 0. 
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B.3 The proof of Theorem|4] 

Before we dive into the proof, we note that Lemma [I] does not hold for the loss estimates used by 
this variant of EXP3-IX due to a subtle technical issue. Precisely, in this case uLh+h) * 

Erli (l + it.i'j prevents us from directly applying Lemma jlj However, Corollary 1 can still be 

proven exactly the same way as done in Section [3] The only effect of this change is that the term 
log(l/<5 7 ) is replaced by K\og(K/S'). 

Turning to the actual proof, let us fix an arbitrary S' £ (0,1) and introduce the notation 


K 


Pt,i 


Q* = E , 


By the standard Exp 3-analysis, we have 

ei (y — @t,j ] l 


Now observe that 


t =1 \i=l 


T K 


log K r? 
T) + 2 


T K 


EE^^m) ■ 


T K 


t= 1 2 = 1 


EI>,(L) = EE^ 

t—1 i= 1 Z ' L 

T K 

<EE 


t= 1 2=1 


^t,i 




■t 


t,i 


K\og(K/5 f ) 
2 7 


- y 


t =i 


Klog(K/S r ) 
27 ’ 


holds with probabihty at least 1 — S' by an application of Corollary [I] for all i and taking a union 
bound. Furthermore, we have 


K 


K 


K 


^ ^ ^ ^ Pt,i^t : i H - ^ ^ ( @t,i Ot,i T) 




2=1 

K 


2=1 

K 


— ^ ^ Pt,i^t,i H" ^ ^ Ot,i) 


O t ,i + 7 


2=1 2=1 

By the Hoeffding-Azuma inequality, we have 

T T K 

Ev«. <EE ft A< 


Ot,i + 7 


- 7<9t- 


£=1 


£=1 2=1 


Tl 0 g(l/^) 


with probability at least 1 — 5'. After putting the above inequalities together and applying Lemma[T| 
we obtain the bound 


Rt < 


log AT k^AT/J 7 ) /?? 


r? 27 

t rr 


(IHS> 


V AHog(A:/(5 7 ) | _ /Tlogtl/^ 7 ) 
2 ' 27 


E E fas - 0t s') 


t-1 2 = 1 


Pt,i^t 7 i 

°t,i + 7 


that holds with probability at least 1 — 3 S' by the union bound. To bound the last term on the right 
hand side, observe that 

K 0 


Xt — yy — Ot.z) 


2=1 


Ot,i + 7 
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is a martingale-difference sequence for all i £ [K] with \X t \ < K and conditional variance 


a 2 t (X t ) =E 


<E 


=E 


K 


^ ' {ot,i Ot,i) 


Pt,i 


\i=l 
f K 


Pt,i 


<E 


(S 0t,t °M+7 

K K 

i=l 0=1 
K K 


Ot,i + 7 
2 

F t -i 


Ft-1 


(since E [O t)i | F t -\] = o M ) 


Pt,i Pt,j 


°t,i + 7 °t,o + 7 


Ft -r 


yz yz °t>« 


_ i=i j =i 


Pt,i Pt,j -r-’ 

-i- ' -i- ^t-1 

o t ,i + 7 Ot tj + 7 

K K 


(since 


S' °M + 7 o* j +7 tt ’ St + 7 

Thus, an application of Freedman’s inequality (see, e.g.. Theorem 1 of Beygelzimer et al. J71), we 
can thus obtain the bound 

LU 

t—1 t—1 

that holds with probability at least 1 — 5' for all u; < 1/K. Combining this result with the previous 
bounds and using the union bound, we arrive at the bound 


.log K log (K/5') log(l/5') /T) 


2 7 


Rt ^——-1- ‘S ' —- + 

that holds with probability at least 1 — 45'. 

Invoking Lemma 1 of Kocak et al. 03 that states that 

K 


(1 + 7HX>+I 


V K\og(K/5’) + /Tlog(l/5') 


t= 1 


2 7 


——— < 2cr log ( 1 

^ <H,i + 7 " V 


\K 2 /j] +K 


i =1 


a 


+ 2 


holds almost surely and setting S’ = 6/ 4, we obtain the bound 


Rt + lQ g( 4A ^ + lQ g( 4 /^ + („ + 2 7 + 2w) a'T + — ■ Klo ^ K / 5 ) + , / T1 °g( 4 /<5) 

ry 27 w 2 27 


that holds with probability at least 1 — 5, where a' = a log ^1 

IogA and u = 


\k 2 /^+k 


1. 


Now notice that when setting r/ = 2j = y o a Tiog(KT) 
2a log (KT) and the above bound becomes 


log(4/(?) 
2aT log(ifT) 


, we have a' < 


Rt < (4 + 2 v/log(4 Jsfj ■ -\J 2aT (log 2 K + log AT) + log (4/5) + 

/Tlog(4/5) K log {AK/8) 

+ V-2- + -2-■ 

The proof is concluded by observing that the last term is bounded by the third one if T > K 2 / (8a). 
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